Cancerous Cells Classification - Neural Network
Introduction
Breast cancer is “the most common cause of cancer deaths among women worldwide”. In the United States, breast cancer is second to lung cancer related deaths, making it a national health critical issue. Statistical facts show breast cancer as the most frequently diagnosed cancer in women in 140 out of 184 countries. Key to survival and remission of breast cancer is closely linked with early detection and intervention.(Henderson 2015).
Early signs of irregular cells growth are detected by sampling and analyzing nuclear changes and parameter using diagnostic tools. The results of these nuclear morphometry tests are evaluated for structural deviations, which are representative of cancer diagnosis.(Narasimha, Vasavi, and Harendra Kumar 2013) Now, considering the significance in accuracy of these evaluations, one may question how the medical industry uses machine learning models like neural networks to augment diagnosis and judgement of such vital medical assessments.
The following is post-study report of numerous breast cancer preventive screenings, scrutinizing cell nuclei parameters in order to classify the specimens as either malignant or benign. The data have been previously categorized, thus, the intent here is to employ a neural network methodology to replicate this categorization and measure the algorithm’s effectiveness supporting medical professionals in the identification and early detection of breast carcinoma.
Data
The employed data, Breast Cancer Wisconsin - Diagnostic,comes from the UCI Machine Learning Repository. The multivariate set contains 569 instances and 32 attributes as described bellow. As previously stated, the data set features quantitative observations representative of the images obtained by means of a fine needle aspirate (FNA). These digitized samples were studied, measured and recorded, ultimately enabling the classification of every instance as malignant or benign. Essentially, there is a total of 10 real-value features per cell, however, given the 3-dimensional fragmentation of each cell sampling, it produces a total of 30 observations across 3 planes.(William Wolberg 1993)
- Variables list across each plane
1 -
ID Name:: Identification number of the sample2 -
Diagnosis:: Dependable variable label. M = Malignant, B = Benignant3 - Feature_1
Radius:: Mean of distances from center to points on the perimeter4 - Feature_2
Texture:: Standard deviation of gray-scale values5 - Feature_3
Perimeter::6 - Feature_4
Area::7 - Feature_5
Smoothness:: Local variation in radius lengths8 - Feature_6
Compactness:: Perimeter^2 / Area - 1.09 - Feature_7
Concavity:: Severity of concave portions of the contour10 - Feature_8
Concave Points:: Number of concave portions of the contour11 - Feature_9
Symmetry::12 - Feature_10
Fractal Dimension:: Coastline approximation - 1
Exploratory Data Analysis
Basic descriptive statistics of the data set.
# sample of statistical summary - Plane 1
describe(df[1:10], interp = TRUE, ranges = FALSE) vars n mean sd skew kurtosis se
Diagnosis* 1 569 1.37 0.48 0.53 -1.73 0.02
Feature_1 2 569 14.13 3.52 0.94 0.81 0.15
Feature_2 3 569 19.29 4.30 0.65 0.73 0.18
Feature_3 4 569 91.97 24.30 0.99 0.94 1.02
Feature_4 5 569 654.89 351.91 1.64 3.59 14.75
Feature_5 6 569 0.10 0.01 0.45 0.82 0.00
Feature_6 7 569 0.10 0.05 1.18 1.61 0.00
Feature_7 8 569 0.09 0.08 1.39 1.95 0.00
Feature_8 9 569 0.05 0.04 1.17 1.03 0.00
Feature_9 10 569 0.18 0.03 0.72 1.25 0.00
Density exploration of diagnosis across four specific variables:
radius,area,texture, andconcave
# Explore density
g1 <- ggplot(data = df)+
theme_minimal()+
geom_density(mapping = aes(Feature_1, fill = Diagnosis), col="darkgrey", show.legend = FALSE, alpha=0.5)+
labs(title = "Density of Diagnosis per Attribute | Red = Benignant, Blue = Malignant", y = " ", x = "Radius")
g2 <- ggplot(data = df)+
theme_minimal()+
geom_density(mapping = aes(Feature_2, fill = Diagnosis), col="darkgrey", show.legend = FALSE, alpha=0.5)+
labs(title = "", y = " ", x = "Texture")
g3 <- ggplot(data = df)+
theme_minimal()+
geom_density(mapping = aes(Feature_4, fill = Diagnosis), col="darkgrey", show.legend = FALSE, alpha=0.5)+
labs(title = "", y = " ", x = "Area")
g4 <- ggplot(data = df)+
theme_minimal()+
geom_density(mapping = aes(Feature_7, fill = Diagnosis), col="darkgrey", show.legend = FALSE, alpha=0.5)+
labs(title = "", y = " ", x = "Concavity")
grid.arrange(arrangeGrob(g1, g2, g3, g4), nrow=1)Key insight, these density plots are useful alternatives illustrating continuous data point. The selected variables are just examples of what can swiftly be illustrated to identify potential relationships between the feature and dependent variable.
Pre-Visualization of Data
Here I use a conditional inference tree to estimate relationships across the data and how its recursively partitioned by the algorithm criteria. The main intent here is to have an idea on what to expect regarding the classification of this data set.
# mutate diagnosis character to numeric
df <- df |>
mutate(Diagnosis = ifelse(Diagnosis == "M", 1, 0))
# CTREE model and plot
model <- ctree(Diagnosis ~., data = df)
plot(model, type="extended", ep_args = list(justmin=8),
main="Breast Cancer | Preliminary Analysis",
drop_terminal=FALSE, tnex=1.5,
gp = gpar(fontsize = 12, col="darkblue"),
inner_panel = node_inner(model, fill=c("white","green"), pval=TRUE),
terminal_panel=node_barplot(model, fill=rev(c("darkred","lightgrey")), beside=TRUE, ymax=1.0,
just = c(0.95,0.5), ylines=TRUE, widths = 1.0, gap=0.05,
reverse=FALSE, id=TRUE)
)Scatterplot of Matrix (SPLOM)
This scatterplox matrix portraits immediate correlation between the included variables and possible multicollinearity among these. The selected features were limited to the first plane only (items 1 to 10). The other two planes include similar features. My immediate take was to consider a method for dimensionality reduction even when considering a neural network classification model.
# Plot Plane I
clrs <- c("darkred", "lightgrey")
pairs(df[1:11], fill=clrs, main = "Plane I - Matrix of Scatterplots",
cex.main= 2.0, cex.labels = 1.0, lower.panel = NULL, pch = 21,
col="grey", bg = clrs [unclass(df$Diagnosis)])
par (xpd = TRUE)
legend (0.10, 0.01, horiz = TRUE, as.vector(unique(df$Diagnosis)), fill=clrs, bty = "n")Dimensionality Reduction
As defined, the purpose of dimensionality reduction is to find a method that can represents a given data set using a smaller number of features but still containing the original data’s properties. I know there are different methods to accomplish this, case in point, LDA, PCa, t-SNE, K-NN, UMAP, etc., but I ultimately decided to use PCA for feature extraction. The following steps illustrate how the method identifies eigenvector of largest eigenvalues of across the covariance matrix. As a result, I create a sub-set of the data using these variables with maximum influence on variance.
# pca of original data
res.pca <- PCA(df, scale.unit = TRUE, graph = FALSE, ncp = 4)
eig.val <- get_eigenvalue(res.pca)
var <- get_pca_var(res.pca)
# Color by cos2 values: quality on the factor map
fviz_pca_var(res.pca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)Sub-selected Features
Here’s my source code to create a subset based on the PCA, followed by conditioning the target variable as a factor for down-sampling purposes. The down-sampling approach was ensure an equally number of target outcomes and avoiding any model disposition towards one way or the other. Completed the down-sampling, I scale and centered the data to maximize model performance.
# selection of principal components
pdf <- df |>
select(Diagnosis, Feature_1, Feature_2, Feature_3, Feature_4, Feature_6, Feature_7, Feature_8,
Feature_10, Feature_11, Feature_13, Feature_14, Feature_16, Feature_18, Feature_20,
Feature_21, Feature_23, Feature_24, Feature_26, Feature_27, Feature_28,
Feature_30
)
# mutating as a factor for downsampling
pdf <- pdf |>
dplyr::mutate(Diagnosis = as.factor(Diagnosis))
# class definition
target <- "Diagnosis"
# downsampling
set.seed(12345)
downsampled_df <- downSample(x = pdf[, colnames(pdf) != target], y = pdf[[target]])
downsampled_df <- cbind(downsampled_df, downsampled_df$Class)
colnames(downsampled_df)[ncol(downsampled_df)] <- target
# subset, mutate, and scale
df2 <- pdf
df2 <- df2 |>
mutate(Diagnosis = as.character(Diagnosis),
Diagnosis = as.numeric(Diagnosis))
df2[, -1] <- scale(df2[, -1])Final summary statistics across the data set.
| Name | df2 |
| Number of rows | 569 |
| Number of columns | 22 |
| _______________________ | |
| Column type frequency: | |
| numeric | 22 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Diagnosis | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| Feature_1 | 0 | 1 | 0.00 | 1.00 | -2.03 | -0.69 | -0.21 | 0.47 | 3.97 | ▂▇▃▁▁ |
| Feature_2 | 0 | 1 | 0.00 | 1.00 | -2.23 | -0.73 | -0.10 | 0.58 | 4.65 | ▃▇▃▁▁ |
| Feature_3 | 0 | 1 | 0.00 | 1.00 | -1.98 | -0.69 | -0.24 | 0.50 | 3.97 | ▃▇▃▁▁ |
| Feature_4 | 0 | 1 | 0.00 | 1.00 | -1.45 | -0.67 | -0.29 | 0.36 | 5.25 | ▇▃▂▁▁ |
| Feature_6 | 0 | 1 | 0.00 | 1.00 | -1.61 | -0.75 | -0.22 | 0.49 | 4.56 | ▇▇▂▁▁ |
| Feature_7 | 0 | 1 | 0.00 | 1.00 | -1.11 | -0.74 | -0.34 | 0.53 | 4.24 | ▇▃▂▁▁ |
| Feature_8 | 0 | 1 | 0.00 | 1.00 | -1.26 | -0.74 | -0.40 | 0.65 | 3.92 | ▇▃▂▁▁ |
| Feature_10 | 0 | 1 | 0.00 | 1.00 | -1.82 | -0.72 | -0.18 | 0.47 | 4.91 | ▆▇▂▁▁ |
| Feature_11 | 0 | 1 | 0.00 | 1.00 | -1.06 | -0.62 | -0.29 | 0.27 | 8.90 | ▇▁▁▁▁ |
| Feature_13 | 0 | 1 | 0.00 | 1.00 | -1.04 | -0.62 | -0.29 | 0.24 | 9.45 | ▇▁▁▁▁ |
| Feature_14 | 0 | 1 | 0.00 | 1.00 | -0.74 | -0.49 | -0.35 | 0.11 | 11.03 | ▇▁▁▁▁ |
| Feature_16 | 0 | 1 | 0.00 | 1.00 | -1.30 | -0.69 | -0.28 | 0.39 | 6.14 | ▇▃▁▁▁ |
| Feature_18 | 0 | 1 | 0.00 | 1.00 | -1.91 | -0.67 | -0.14 | 0.47 | 6.64 | ▇▇▁▁▁ |
| Feature_20 | 0 | 1 | 0.00 | 1.00 | -1.10 | -0.58 | -0.23 | 0.29 | 9.84 | ▇▁▁▁▁ |
| Feature_21 | 0 | 1 | 0.00 | 1.00 | -1.73 | -0.67 | -0.27 | 0.52 | 4.09 | ▆▇▃▁▁ |
| Feature_23 | 0 | 1 | 0.00 | 1.00 | -1.69 | -0.69 | -0.29 | 0.54 | 4.28 | ▇▇▃▁▁ |
| Feature_24 | 0 | 1 | 0.00 | 1.00 | -1.22 | -0.64 | -0.34 | 0.36 | 5.92 | ▇▂▁▁▁ |
| Feature_26 | 0 | 1 | 0.00 | 1.00 | -1.44 | -0.68 | -0.27 | 0.54 | 5.11 | ▇▅▁▁▁ |
| Feature_27 | 0 | 1 | 0.00 | 1.00 | -1.30 | -0.76 | -0.22 | 0.53 | 4.70 | ▇▅▂▁▁ |
| Feature_28 | 0 | 1 | 0.00 | 1.00 | -1.74 | -0.76 | -0.22 | 0.71 | 2.68 | ▅▇▅▃▁ |
| Feature_30 | 0 | 1 | 0.00 | 1.00 | -1.60 | -0.69 | -0.22 | 0.45 | 6.84 | ▇▃▁▁▁ |
Visualization of variables as defined by the PCA across the first and second dimensions.
# pca of subset data
res.pca <- PCA(df2, scale.unit = FALSE, graph = FALSE, ncp =4)
eig.val <- get_eigenvalue(res.pca)
var <- get_pca_var(res.pca)
# Color by cos2 values: quality on the factor map
fviz_pca_var(res.pca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)Data Partitioning
Model Training and Visualization
Training of the neural network using the R library of neuralnet. This poweful algorithm “is based on the resilient backpropagation without weight backtracking and additionally modifies one learning rate, either the learningrate associated with the smallest absolute gradient (sag) or the smallest learningrate (slr) itself. The learning rates in the grprop algorithm are limited to the boundaries defined in learningrate.limit.”(Fritsch, Guenther, and Wright 2019)
# neuralnet
nn <- neuralnet(Diagnosis ~., data = trainData, hidden = c(3),
lifesign = "minimal", linear.output = FALSE, likelihood=TRUE
# act.fct = "tanh", err.fct = "sse"
)
# plot model
plot(nn, radius = 0.03, arrow.length = 0.16, intercept = TRUE,
intercept.factor = 0.2, information = TRUE, information.pos = 8,
col.entry.synapse = "black", col.entry = "maroon4", line_stag= 0.1,
col.hidden = "darkblue", col.hidden.synapse = "dimgrey",
col.out = "green", col.out.synapse = "blue",
col.intercept = "red", fontsize = 9, dimension = 2,
show.weights = TRUE, rep = "best")Model’s Evaluation
Observation & Notes | Section in-progress …
- Approach
Training
Testing
Cross-validation
# model evaluation
mypredict <- neuralnet::compute(nn, nn$covariate)$net.result
mypredict <- apply(mypredict, c(1), round)
# confusion matrix - training set
print(table(mypredict[1:length(trainData$Diagnosis)], trainData$Diagnosis, dnn =c("Actual","Predicted"))) Predicted
Actual 0 1
0 158 0
1 0 97
# accuracy(trainData$Diagnosis, mypredict[1:length(trainData$Diagnosis)]) #cross-entropy # model evaluation
testPred <- neuralnet::compute(nn, testData[,1:22])$net.result
testPred <- apply(testPred, c(1), round)
# confusion matrix - test set
print(table(testPred[1:length(testData$Diagnosis)], testData$Diagnosis, dnn =c("Actual", "Predicted"))) Predicted
Actual 0 1
0 97 5
1 2 56
# accuracy(testPred[1:length(testData$Diagnosis)], testData$Diagnosis) #cross-entropy # model evaluation
xvalPred <- neuralnet::compute(nn, xvalData[,1:22])$net.result
xvalPred <- apply(xvalPred, c(1), round)
# confusion matrix - xval set
print(table(xvalPred[1:length(xvalData$Diagnosis)], xvalData$Diagnosis, dnn =c("Actual", "Predicted"))) Predicted
Actual 0 1
0 97 0
1 3 54
# accuracy(xvalPred[1:length(xvalData$Diagnosis)], xvalData$Diagnosis) #cross-entropy Mean Scores Comparison
Observation & Notes | Section in-progress …
ROC Curve
Observation & Notes | Section in-progress …
# ROC curve analysis
pred <- neuralnet::compute(nn, nn$covariate)$net.result
predObj <- prediction(pred[1:length(xvalData$Diagnosis)], xvalData$Diagnosis)
rocObj <- performance(predObj, measure="tpr", x.measure="fpr")
aucObj <- performance(predObj, measure = "auc")
plot(rocObj, main = "ROC Curve", cex.lab=1.25, cex.main = 1.5, col = "blue")
text(.75, .25, paste("Area under the curve:", round(aucObj@y.values[[1]], 4)),
col = "darkred", cex = 1.25)